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Abstract 

Tests are essential in Information Retrieval and Data Mining in order to evaluate the effectiveness 

-Y-v of a query. An automatic measure tool intended to exhibit the meaning of words in context has been 

^_^ developed and linked with Quantum Theory, particularly entanglement. Quantum like experiments were 

f^ undertaken on semantic space based on the Hyperspace Analogue Language (HAL) method. A quantum 

^Sj HAL model was implemented using state vectors issued from the HAL matrix and query observables, 

» I testing a wide range of window sizes. The Bell parameter 5", associating measures on two words in a 

^ I document, was derived showing peaks for specific window sizes. The peaks show maximum quantum 

■^p violation of the Bell inequalities and are document dependent. This new correlation measure inspired 

by Quantum Theory could be promising for measuring query relevance. 

Cn Keyw^ords: Bell inequality, entanglement, information retrieval, co-occurrence, HAL, tests, context, IR algorithms, Quantum 

Theory 

1 Introduction 

, O, In this work we present original quantum-like tests that could be useful in the domain of Infor- 

mation Retrieval (IR) and Data Mining. 
(N ^ 

J> Context is used to disambiguate terms. Meluccip] showed that a query or a document can be 

i^ generalized, in different contexts, as vectors, where the likelihood of context of a set of documents 

^N can be considered. Quantum Mechanics (QM) has been invoked to enrich the search capabilities in 

• IR by Rijsbergen[2] by using the mathematical formalism of the Hilbert vector space. 

(^ Analogies between concepts issued from Quantum Theory with Information Retrieval tools have 

I been made by several authors. Widdows[3] uses the quantum formalism for experiences with nega- 

^ tion and disjunction and Arafat [1] shows that user needs can be represented by a state vector. 

k>( Other analogies have been stated by Li & CunnighamfS] such as: state vector/objects in a col- 

H lection; observable/query, eigenvalues/relevance or not for one object; probability of getting one 

eigenvalue/relevance degree of object to a query. Bruza and Cole explicitly calculated eigenvec- 
tors associated to a word|6j and in the field of concept representation associated with the CHSH 
inequality an explicit Bell inequality violation was found [7]. 

2 Bell Inequality and Bell Parameters for binary outcomes 

Entanglement, which can be made manifest through Bell inequality violations [H] (commonly 
presented in the form of the CHSH inequality [9]) has become a very important research trend in 
Physics. Several experiments have proved the existence of entangled particles [IH [TO] and this fact 
is now widely accepted. This field has fascinated many scientists throughout the last decades also 
leading to much parallel scientific and pseudoscientific research as is well described by Keiser in a 
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recent book|12J. Part of the attraction arises because of the concept of nonlocahty of the quantum 
world suggesting spooky action at distance (a discussion can be found in[T3]). Even though in 
general the violation of Bell inequalities demands entanglement, higher violations of the inequalities 
do not necessarily mean mandatorily more entanglement. 

Quantum Information has emerged bridging physics and information science. Though initially 
discovered in the context of foundations of Quantum Mechanics, the violations of Bell inequalities 
referred above are nowadays a key point in a wide range of branches of Quantum Information. 
Entanglement is at the heart of this field because it is seen as a potential resource for new applications 
such as coding or computing |14j. New theorems of the kind of Bell, named no-go theorems (for 
example the Kochen-Specker theorem|15j). have been proposed. 

In practice most experiments have used polarized photons as the famous experiment in 1982 by 
Aspect et al.|llj. More sophisticated set-ups are constantly being proposed and discussed [Ml HTj 
very often to rule out local hidden variable models. 

Some macroscopic tests have been proposed in the form of thought experiments or combined 
with yes- no questions showing also Bell inequality violations, for example by Aerts[T8]. 

The CHSH-Bell parameter SbcU for tests with two binary outcomes, +1 or —1, can be defined 
as follows: 

SbsII = \E (A, B)-E {A, C)\ + \E {B, D) + E (C7, D)\ (1) 

where A, B, C and D are tests and E (X, Y) stands for the expectation value of the outcome of 
mutual tests X and Y . We briefly recall some important facts about the Bell parameter. It is 
easily verified that SbcU can never exceed 4. More specifically. Bell inequality is derived for the 
so called classical, local and separable situation where Sseii lies between and 2. In this case, for 
example, we could write E {X, Y) = E {X) E {Y). The case 2 < SbcU < 2-v/2 can be achieved with 
quantum entangled states obtained experimentally with photons. Less underlined is the case where 
SseU > 2v2 , also known as the Tsirelsons bound [T6t I19j. This zone between 2v2 and 4 is called 
the no-signaling region. The maximum value SbsU = 4 can be attained with logical probabilistic 
constructions often named PR boxes [20l. 



3 Bell Tests in Semantic Space using HAL 

Our approach presented here can be perceived as an experiment done on objects outside the 
domain of physics. The objects are words within texts. We study the relationships between words 
within a document; these relationships can be formed by creating a semantic space using the Hy- 
perspace Analogue Language (HAL) method [21]. 

The HAL algorithm does not require any explicit human a-priori judgment. In this procedure 
a HAL lexical co-occurence matrix is built with a "window" representing a span of words passed 
over the corpus being analyzed. The width of this window can be varied. Words within the window 
are recorded as co-occurring with strength inversely proportional to the number of other words 
separating them within the window. 



The point of the co-occurrence matrix is that the rows effectively constitute vectors in a high- 
dimensional space, so that that the elements of the normalized vectors are frequency counts (proba- 
bilities), and the dimensionality of the space is determined by the number of columns in the matrix 
(context vectors). 

The HAL method has already been used as a tool for a physical analogy between semantic 
space and Quantum Theory, where at each word was associated a given spectrum (in analogy with 
spectral emission lines of atoms) [22]. 

Our method uses the HAL matrices for calculating quantum mechanical mean values of Query 
Observables and combining them in order to calculate a Bell parameter S query 

We carried out our tests in a symmetric matrix obtained by the sum of the HAL matrix and its 
transpose (equivalent to run HAL backwards). This is due to the fact that we did not consider the 
order in which words appear in a text. 

The tests were carried out on documents in English. The programming scheme of the algorithm 
implementation is represented in Figure 3 (Appendix). 

4 Quantum model for Bell Tests using HAL 

In this section we intend to define operators, in analogy with Quantum Theory, that will give a 
new possible approach to document queries. We make the following definitions. 

4.1 Document vector states 

Each document will have an associated vector. The vector state of th e document is the normal- 
ized linear sum of all the words it contains. Each word vector state is extracted from the normalized 
lines of the symmetric HAL matrix. It is defined as: 

l'/>) = ^^^EK) (2) 

\lT.{Wj \Wk) 

We will be interested in analyzing how two words are connected within a document, namely 
word A and word B. The two word vectors define a plane on the semantic space. We will not 
consider the part of the document corresponding to the orthogonal projection with the two words 
we are interested in. In this case only the projection of the vector state document in this plane is 
considered. The resulting state vector \ip) from now on will be the document vector state. This 
vector can be written on two different orthogonal basis: {|tfA) , |w^yl±)} and {\wb) , {wbx)}- The 
orthogonal component vectors are obtained by the Gram-Schmidt orthogonalization process. The 
document vector state takes then the form: 

\ip) = a \w/^ + a_L \wAi}i = P \wb) + /?± \wba_) (3) 

The coefhcientsa, q;_l, /3 and /3_l are obtained by projecting the original state IV') in the basis 
vectors and normalizing them in a way that the new l'^) is normalized to the unity. For example 



for a we have: 

^^-^''^^ (4) 

4.2 Query operators 

We want now to define query operators. The purpose of these operators is to quantify a query 
within this formalism. The query operators A and B are defined in a way that they attribute the 
value +1 to the component of the state that corresponds to the word meaning we are interested 
in, and 1 in the orthogonal direction. More precisely we will use operators acting as the spin Pauli 

matrix a^ = on their respective decomposition basis. We can associate these operators 

with observables because they are Hermitian. Explicitly: 

A\'4)) = a\wA] - a^\wAi_) , B\'4)) = ii\wB) - ^^\wbi_) (5) 

Expectation values of operators are calculated in the same way as in quantum formalism, for 
example, the mean value in the context of ^ for the query about A is written as usual in quantum 
mechanics: 

(^)^ = (^|i|V) = a'-ai = 2a2-l (6) 

From this example we see that we can obtain a score for the expected value of a search related 
to word A which corresponds to something reasonable as a score of a query in the document since it 
increases with a (the scalar product between the document state and the words state). The values 
range from +1 to 1, being +1 when the document vector is parallel to the query vector, and 1 when 
it is orthogonal. Following this line of thought other operators can be defined using, for example, 

the Pauli matrix o"t = 

For the choice of the query operator A^ = Ox this would be: 

Ax (a \wA] + a± \wAi.)) = a_L \wa] + a \wai.) (7) 

This operator switches the components of the vector state. This can be interpreted as a measure 
of different meaning in the document with respect to the original A direction. We do not consider 

the expectation values for the spin Pauli matrix fj; = . due to the fact that here the 

components of the vector state issued from the HAL matrix are always real, then the expectation 
values associated to this operator are zero. Possible future generalizations may include a way of 
obtaining vector components on the complex plane. 



4.3 Combining operators and expectation values of two queries 

For technical reasons we choose one of the basis given above in Eq. [5] and write the operators 
with respect to this basis. If we choose the basis relative to A, we can write the transformation 
matrix M from the A basis to the B basis. It is easy to see that: 

M=( ^^b\wa) {wb\wa±) \ ,g, 

\ {wB±\ wa) {wb±\ wax) J 

By construction, this matrix can be simply expressed by the parameter p = {wb\ wa) which is 
always positive and smaller then 1, unless there is a perfect alignment between the two words. The 
matrix can thus be written as 

M=( P_ '^^^] (9) 

V V 1 - p^ -p / 

Any operator expressed in its matrix form on the basis associated to B can be written in the 
basis associated to A using the transformation matrix M. From our previous definition of B, its 
matrix form in the basis associated to A becomes: 



B = ( '"'-L '"-^ ] (10) 

V 2p^/l^^ l-2p^ J 

With the two operators written in a common basis we can now combine two query operators and 
easily calculate mean values. For example we get: 

Ab) =2p^ -1 (11) 

The expected value of a joint query is then uniquely determined by the inner product of the 
two word states. Note that even though AB ^ BA, meaning that the observables A and B do not 



commute, we have the equality when we take the mean values \AB) = (BA 

4.4 Bell parameter calculation 

Bell tests are usually a proof of a non-separability of the combination of two different systems. 
This case is not completely analogous. In fact we are dealing with only a two dimensional space 
which can be understood as the sense associated to a word ^ in a document or the sense of another 
word B. For purposes of document analysis we have chosen to take an approach leading to the 
calculation of a Bell parameter as defined in Eq. [T} Concretely we calculate quantum mean values 
with different query operators which can be considered as measuring devices: 



Jquery 



AB+) + iA^,B 



+ 



AB-)-(A^B 



(12) 



using the following operators 

^ ~ ~ B -\- B^ -1 B — Bq. 

A- A,- S+ = -^; B^ = —^ (13) 

Our particular operator choice was inspired from the usual example that maximizes the violation 
of the Bell inequalities for a particular Bell state of two qubitsp^. even though our global model 
is different. All operators have the property of being their own inverse, that is, their square is the 
identity (property of the Pauli matrices) which means that their eigenvalues are +1 and 1. With 
this we can calculate the corresponding parameters considering different queries among different 
documents. Two examples are presented in the next section. 

5 Results and Discussion 

With the formalism established we are in a position to apply it to different documents. We 



calculated the Bell parameter defined in Eq. 12 and we will discuss the results in a relevance 



perspective. In the following examples all the documents are taken from Wikipedia. 

5.1 Test on Documents: Reagan and Iran 

As a first application we considered an example originally introduced by Bruza and Cole [6], 
which is the query for the word Reagan in the context of Iran. If we talk about Reagan alone one 
usually associates this with the fact that he was President, but if we include Iran it will be more 
likely that we are interested in the Iran-contra scandal. Four documents were considered which are 
close to the query: Reagan administration scandalq^ ReagaiQ IranContra affair''^] and IrarQ 

We plot the parameters S query defined above in Eq. [12] as a function of the HAL window for the 
query of the words Reagan and Iran. The results are shown in Figure [T] 

There is clearly a common behavior for the three queries in the documents (with just one 
exception): the parameter starts from zero and increases until it reaches a maximum, never crossing 
the Tsirelson's bound 2\/2 , but getting very close to it, and then drops again. This suggests that 
each document, given a two word query, has an optimal HAL window size that maximizes the 
parameter Squery For the query of Iran - Reagan, among the four documents, it is predictable that 
the document that is more closely related to this query is the Iran-Contra affair, followed by the 
documents: Reagan administration scandals and Reagan, with an expected greater relevance for 
the first.. The least related document should be Iran. At first sight it may appear that since we are 
looking for Reagan - Iran, the documents Reagan and Iran should appear on the same level in the 
search.. However in general, the meaning Reagan has less importance in the context Iran (because 
the common concept Iran includes its history, culture, geographical situation, etc.) than Iran in the 
context of Reagan. In Figure [T] we also observe that this is basically the order in which the peaks 



^http://en.wikipedia.org/wiki/Reagan_admmistration_scandals (accessed 12/04/2013) 
^http://en.wikipedia.org/wiki/Reagan (accessed 12/04/2013) 
^http://en.wikipedia.org/wiki/Iran-Contra_affair (accessed 12/04/2013) 
*http://en.wikipedia.org/wiki/Iran (accessed 12/04/2013) 
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Figure 1: Bell parameter for the query of words Reagan - Iran in four documents: Reagan admin- 
istration scandals, Reagan, Iran-Contra affair and Iran 



appear. 

The document regarding Iran always gives a constant value of 2. This fact is easily explained 
in the framework of our model. In fact it is not hard to see that when we do a two word query 
in which one of them is not present in the document, the result for S query is always 2. Besides if 
neither of the two words is present the result is always zero. Let us now consider the other three 
documents. 

The first peak appears using a window length around I = 30. The Squery curve peaks before 
this value in the document of Iran-Contra affair. The other two documents cannot be clearly 
distinguished. This corresponds to our previous prediction. In fact, it makes sense that the sooner 
a peak appears the less interaction, in the sense of window length, we have to consider in order to 
obtain high correlation between words. Bearing this in mind, the document Iran-Contra affair is 
clearly the one selected by the model. The other two documents (Reagan and Reagan administration 
scandals) are not clearly distinguishable. On one side the peak of the Reagan document appears 
first, but the curve for Reagan administration scandals has a bigger extension close to the Tsirelson's 
bound 2-v/2, meaning high correlation for several window sizes, which can also be a clue for some 
strong correlation between words. 

5.2 Test on Documents about Orange 

The second case considered concerns the polysemy of the word orange and associated concepts. 
In this example we are interested in the ambiguity between the meanings color and fruit. We also 
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Figure 2: Bell parameter for the query on the words Orange - Fruit and Orange - Juice in four 
different documents: Juice, Orange (Color), Orange (Fruit) and Orange Juice. 

associate the concept of juice. The documents considered were: Orange (Colour )F1 Orange (Fruit )pl 
Orange Juicqjand Juicq^ Two queries are considered: Orange Fruit and Orange Juice. The results 
are presented in Figure [2j 

The query Orange - Fruit presents the first peak around / = 22 for the document Orange color, 
the second in / = 29 for the document Orange fruit, then for / = 40 the document Orange juice and 
very far away the document Juice. It is interesting to note that, even though the peaks are close, 
the peak of the curve Orange - Color appears before the one for Orange - Fruit. This may be the 
suggestion of a strong correlation between the origin of the name of the color and the name of the 
fruit. The poor correlation of the general term Juice with the specific query Orange - Fruit is very 
clear on the graph according to this criterion. 

Finally, the last query was Orange - Juice. Again, here, we recover precisely the order that we 
would expect for the documents: Orange juice, Orange Fruit, Juice and Orange Color. It is worth 
noticing that in the latter case the peak corresponds to a window length range considered to be 
optimal for implementation of HAL {I = 10) which may indicate an even a stronger correlation 
between the words. 



6 Conclusion and Perspectives 

In this work, we presented a novel search experiment based on the Bell parameter extraction in 
semantic space using the HAL method. 

The semantic vectors in HAL are representations that are essentially measures of context. The 
HAL method has already been used for the analogies with Quantum Theory by Bruza[23] for 
activating associations of concepts and by Wittek and Darny[52] for extracting spectral content 



^http://en.wikipedia.org/wiki/Orange_(colour) (accessed 12/04/2013) 
^http://en.wikipedia.org/wiki/Orange_(fruit) (accessed 12/04/2013) 
^http://en.wikipedia.org/wiki/Orange_juice (accessed 12/04/2013) 
*http://en.wikipedia.org/wiki/Juice (accessed 12/04/2013) 



from the semantic space. HAL shows high potentiaHty because it is a simple way to build a 
semantic space with a measure that is independent of user judgment. 

The main feature of Quantum Theory explored in this work is the violation of the Bell inequalities 
which can be related to entanglement and non-locality, impossible at a classical level. The results 
show Bell inequality violation up to the maximal value of Sseii = 2\/2, (the Tsirelsons bound). In 
our model each document is associated to a two dimensional Hilbert space (dependent on the search 
we are interested in), and queries are observables acting on it. A Bell parameter is then defined. 
We found that the Bell parameter is strongly dependent on the HAL window size. From our results 
it is suggested that for this kind of model there is an optimal window size that maximizes the Bell 
parameter. This is reminiscent of what was also noticed by Bruza and Woods [23]: if the window 
size is set too large spurious co-occurrence associations are represented in the matrix and, if the 
window size is too small, relevant associations may be missed. In this model we see that too large 
windows may also dilute connections between associated words. 

Only one document, the one that did not present one of the words of the query, did not violate 
the Bell inequality. In general, a pattern of early appearance of the peak (smallest window sizes) 
seems to be related to the relevance of the document for the search. In a near future other mea- 
sures of quantum properties (as proper measures of entanglement) will also serve to make a better 
comparison between the results issued from the standard information retrieval methods. It is not 
clear how to interpret the Bell inequality violation here and what is the meaning of the optimal 
length that maximizes the Bell parameter. 

Can correlation and entanglement give a measure of query relevance? Experiments and system- 
atic comparisons with other methods used in IR, such as Latent Semantic Analysis (LSA) and the 
ranking method Okapi BM25, could give further indications. An important technical point is that 
we introduced a new tool which has connections with the Quantum Theory: query observables. 
Here we made a practical choice similar to the spin Pauli matrices, but we think that it should be 
possible, after much experimentation on different documents, to introduce new families of query 
observables adapted to different purposes and contexts. In the domain of IR many concepts are 
introduced to define, for example, opinion-like queries in social networks }24|. Efforts are also being 
made in order to diversify query results of ambiguous queries considering concepts such as senti- 
ment diversification to identify positive, negative and neutral sentiments about the search topic 
considered [25]. 
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Appendix: Bell test HAL algorithm 

The algorithm was implemented using Python programming language along with the string 
module and pylab. All words are considered and simple plurals (constructed by adding an s) are 



treated as if singular words. Lower and upper case letters are not distinguished, which means that 
if two words differ from each other by changes on the case, they are considered equivalent. Even 
if words have the same origin, they are treated differently (for example battle and battling are 
distinct). 



Input 
Document 



Construction of a "clean" (no punctuation marlts) sequence 
of words, including eventual repeated words: Doc list. 



Construction of the "dictionary": Sequence of non repeated 
words: Die list. 



* Windo 



Construction of a primitive Hal matrix: for each word of the Doc list a window of length I is associated and all the scores of 
the words within are colected in a matrix. The entry for each score is determined by the position of the words in the Die list. 
Complete Hal matrix isobtained by summing this matrix with its transpose. 



Normalization of each row vector. Determination of the state of the system by suming overall vectors and normalizing. 



Calculation of the expected values of the defined operators and the Bell parameter. 



New window size 1+1 ^ 
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Figure 3: Flow diagram of the Quantum HAL algorithm described. 
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